Go to project home

Description

Project

RNA-seq data in 3 immune cells of 4 donors

Analysis

This is a demo.

Go to project home

STAR options

STAR was run using the following options:

Go to project home

Summary statistics

Table 1. Summary of the summary statistics of all libraries, including the total number of sequence reads, percent of uniquely mapped reads, etc. Click here to see the summary statistics of individual libraries.
Min. 1st Qu. Median Mean 3rd Qu. Max.
Total input, million reads 30.74 32.7000 34.480 36.3400 40.3900 44.52
Alignment rate (%), unique mapping 80.45 82.9800 83.600 84.1800 85.3400 88.79
Alignment rate (%), unique + multiple 92.64 93.3500 93.640 93.6200 93.8700 94.52
Mismatch rate (%) 0.27 0.2800 0.295 0.2992 0.3200 0.33
Deletion rate (%) 0.02 0.0200 0.020 0.0200 0.0200 0.02
Insertion rate (%) 0.01 0.0100 0.010 0.0100 0.0100 0.01
Too many loci (%) 0.05 0.0675 0.110 0.1033 0.1325 0.15
Too many mismatch (%) 0.09 0.1000 0.110 0.1125 0.1200 0.14
Too short (%) 5.23 5.7850 6.055 6.0860 6.4050 7.10

Alignment rate

In most RNA-seq data sets, the percentage of total input reads that can be aligned to reference genome/transcriptome could range between 50% and 90%. Alignment rate is an important quality index of RNA-seq library and high throughput sequencing. However, it also highly depends on the experimental material and protocol, so it is hard to have a predefined cutoff of “high” alignment rate for all data sets. On the other hand, the consistence of alignment rates between samples of the same data set is at least equally important. Inconsistency of alignment rates is usually the consequence of systematic bias during the whole experimental procedure. It adds unwanted between-sample variance into data and might have profound impact on statistic analysis, such as differential gene expression. Therefore, the focus of this analysis is whether there are libraries having much lower alignment rates than the others.

The rate of unique vs. multiple alignment is a similar index of data quality. High percent of multiple alignment might indicate low complexity of sequence reads, higher sequencing error rate, and other issues. This analysis also evaluates the consistency of unique vs. multiple alignment between samples.

Figure 1. The global alignment rate (left) and the rate of unique vs. multiple alignment (right). Each spot represents a RNA-seq library and is colored based on number of sigma. For each library, a linear model is built with all the other libraries and the value of sigma (variance of random error) is obtained from the model. The number of sigma is then calculated by dividing the observed-predicted difference of that library with the sigma value.

Non-canonical splice sites

An important aspect of processing RNA-seq data is to alignment sequence reads to splicing sites, called gap alignment. Most commonly, STAR performs gap alignment first by using the known splicing sites based on the reference transcriptome and then by detecting novel splicing sites based on the reference genome. Most splicing sites have canonical donor/acceptor bases, such as GT/AG. While non-canonical splicing sites have been observed, they are relatively rare and often suggestive of false positives.

Figure 2. The total number of reads gap-aligned reads and the number of gap-aligned with non-canonical splicing sites are fitted to linear models as in Figure 1. On average of all samples in this data set, 1.128% of all gap-aligned reads have non-canonical splicing sites.
Figure 3. Distribution of insertion/deletion/mismatch frequency in all samples.
Figure 4. Distribution of the frequency of poorly aligned reads due to different reasons. The frequency is relative to all mapped reads in the first plot, and relative to all unmapped reads in the others.